arXiv:1507.03176vl [stat.ML] 12Jul2015 


1 


Dependent Indian Buffet Process-based 
Sparse Nonparametric Nonnegative Matrix 

Factorization 

Junyu Xuan, Jie Lu, Senior Member, IEEE, Guangquan Zhang, 

Richard Yi Da Xu, Xiangfeng Luo, Member, IEEE 


Abstract —Nonnegative Matrix Factorization (NMF) aims to factorize a matrix into two optimized nonnegative matrices appro¬ 
priate for the intended applications. The method has been widely used for unsupervised learning tasks, including recommender 
systems (rating matrix of users by items) and document clustering (weighting matrix of papers by keywords). Flowever, traditional 
NMF methods typically assume the number of latent factors (i.e., dimensionality of the loading matrices) to be fixed. This 
assumption makes them inflexible for many applications. In this paper, we propose a nonparametric NMF framework to mitigate 
this issue by using dependent Indian Buffet Processes (dIBP). In a nutshell, we apply a correlation function for the generation 
of two stick weights associated with each pair of columns of loading matrices, while still maintaining their respective marginal 
distribution specified by IBP. As a consequence, the generation of two loading matrices will be column-wise (indirectly) correlated. 
Under this same framework, two classes of correlation function are proposed (1) using Bivariate beta distribution and (2) using 
Copula function. Both methods allow us to adopt our work for various applications by flexibly choosing an appropriate parameter 
settings. Compared with the other state-of-the art approaches in this area, such as using Gaussian Process (GP)-based dIBP, 
our work is seen to be much more flexible in terms of allowing the two corresponding binary matrix columns to have greater 
variations in their non-zero entries. Our experiments on the real-world and synthetic datasets show that three proposed models 
perform well on the document clustering task comparing standard NMF without predefining the dimension for the factor matrices, 
and the Bivariate beta distribution-based and Copula-based models have better flexibility than the GP-based model. 

Index Terms —Nonnegative matrix factorization, Probability graphical model, Bayesian nonparametric learning, Indian Buffet 
Process 
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1 Introduction 

ONNEGATIVE Matrix Factorization (NMF) is a 
renowned tool for factor analysis and unsu¬ 
pervised learning which has been used in many re¬ 
search areas [l]-[4]. In text mining area, for example, 
facforizafion on fhe documenf-ke 5 rword mafrix can 
discover fhe hidden topics from documenfs, and fhese 
topics can assisf documenf clusfering or browsing; In 
fhe recommender system area, facforizafion on fhe 
user-movie mafrix can help find fhe genres of users 
and movies. Based on fhese genres, fhe more accurafe 
recommendafion can be generated. 

Sparse nonnegafive mafrix facforizafion [5] is 
favoured by researchers in several areas due fo ifs 
oufpuf: sparse represenfafion of dafa. Sparse repre- 
senfafion discovers a limited number of componenfs 
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to represenf dafa, which is an imporfanf research 
problem [6]. This sparse represenfafion is generally 
desirable because if can aid human undersfanding 
(e.g., wifh gene expression dafa), reduce compufa- 
fional cosfs and obfain beffer generalizafion in learn¬ 
ing algorifhms. Sparsify is closely relafed fo feafure 
selecfion and cerfain generalizafions in machine learn¬ 
ing algorifhms [7], [8]. 

Alfhough NMF has experienced a boom, wifh sig- 
nificanf developmenf in recenf years, fhe assumpfion 
fhaf number of dimensions of factor mafrices need 
fo be predefined is blocking ifs usage in real-world 
applications. Normally, fhis number (i.e., topic number 
or genre number) is assigned by experfs wifh domain 
knowledge, buf if seems fhaf inaccurafe assignmenf 
will impacf on fhe performance of NMF on such 
applied fasks as documenf clusfering or recommender 
systems. Furfhermore, increase of fhe amounf of dafa 
and complexify of fhe fasks means fhaf even experfs 
are inadequate for fhis job. Therefore, if is more 
reasonable and practical fo aufomafically discover fhe 
topic number from fhe dafa. 

Sfochasfic processes, such as fhe Dirichlef process 
[9] and Indian Buffef Process (IBP) [10], [11], have 
been used fo replace ordinary disfribufions in non- 
paramefric learning [12] for providing priors of in¬ 
finite vectors or mafrices. In fhis confexf, IBP can 
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be seen as the prior of a sparse matrix with an 
infinite number of columns. Therefore, an intuitive 
idea would be to give two factor matrices two IBPs 
priors, thus avoiding the need to predefine the dimen¬ 
sions for the factor matrices. However, the problem is 
how to ensure the two IBPs will generate the same 
dimension number for the two factor matrices if they 
are given to them separately. A close related approach 
was seen to use GP-based dIBP [11]. This model works 
in the following fashion: We let and to be 
the jth corresponding columns of two identical sized 
binary matrices and The method assumes 
that both and share the same stick weight fj,j, 
making the probability of generating the number of 
non-zero entries identical for both Zj and Zj^\ The 
GP is merely used to control the correlations between 
individual entry pairs WITHIN Zj^'^ and Zj^\ by 
thresholding a Gaussian CDF to make the entry to 
be either one or zero. Although this model improves 
upon the traditional NMF, it nonetheless is inade¬ 
quate in many matrix factorisation scenarios as the 
assumption that the total number of non-zero entries 
are distributed identically does not hold universally. 
For example, the number of non-zero factors of a User 
column may be drastically different to the number of 
non-zero factors of a corresponding Item column in a 
recommender system. 

In order to better and more flexibly describe the 
correlations between corresponding IBP columns, we 
instead propose a framework in which the correla¬ 
tion function is used for the generation of the latent 
weights pairs and which are responsible for 
the generation of and respectively. Com¬ 
pared with [11], our work allow both and to 
have much greater flexibility and variations in terms 
of their non-zero entries. To be specific, we propose 
two new dIBPs based on bi-variate beta distribution 
and copula. Instead of correlating two IBPs at the 
binary matrices level, the two proposed dIBPs cor¬ 
relate with the two IBPs at the very beginning (at the 
beta random variable level). This strategy results in 
the implementation of the nonparametric NMF frame¬ 
work being based on new dIBPs with a simple model 
structure. Another advantage of the new dIBPs is that 
the data correlation is directly modeled by the param¬ 
eters of the bi-variate beta distribution and copula. 
Nevertheless, introducing bi-variate beta distribution 
and copula presents a challenge for the model infer¬ 
ence. We have given three designed inference algo¬ 
rithms for three implementations of the nonparamet¬ 
ric NMF framework: GP-based dIBP model, bi-variate 
beta distribution-based dIBP model, and copula-based 
dIBP model. The experiments show that the proposed 
models perform well on the document clustering task 
without explicitly predefining the dimension number 
for the factor matrices, and the models based on new 
dIBPs have better convergence rates than GP-based 


dIBP. 

The contributions of this paper are: 

• Two new dIBPs (i.e., bi-variate beta distribution- 
based and copula-based) with simpler model 
structure and direct correlation modeling capa¬ 
bility are proposed as alternatives to the existing 
GP-based dIBP; 

• Three dIBP-based nonparametric nonnegative 
matrix factorization models are proposed to re¬ 
move the assumption of the traditional NMF 
that the number of factors needs to be fixed in 
advance. 

The rest of this paper is organized as follows. 
Preliminary details of NMF and IBP are briefly intro¬ 
duced in Section 2. Section 3 reviews related work. 
Our dIBP-based nonparametric NMF framework is 
proposed with three implementations in Section 4, 
and Gibbs samplers are designed for the three models 
in Section 5. Section 6 conducts a set of experiments 
on real-world tasks. Lastly, Section 7 concludes the 
study and discusses further work. 

2 Preliminary Knowledge 

2.1 Sparse Nonnegative Matrix Factorization 

Given a nonnegative matrix Y^xn (extended to semi¬ 
nonnegative by [13]), Nonnegative Matrix Factoriza¬ 
tion (NMF) aims to find two matrices A^xk and Xnxk 
to minimize the following cost function, 

J — II Fmxn AjyixkX^^j^ II p T lld-mx/c || 1 T || 2tnx fc || 1 

( 1 ) 

where || • ||f is the Frobenius norm, |j • |j i is the £i norm 
and the elements of A and X are also nonnegative. 
The £i norm in the cost function is used for the 
sparseness constraint. 

NMF is widely used in many different research 
areas. Take the recommender system as an example. 
The input U^xn denotes the ratings given by m users 
on n movies. The A^xk denotes the users' interests, 
and Xnxk denotes the movies' genres. All users and 
movies are projected into the same /c-dimensional 
space by NMF. Based on Amxk and Amxk, a more 
accurate recommendation can be achieved. 

One problem of NMF is to determine k. Nor¬ 
mally, this variable is experimentally adjusted within 
a range. The problem of NMF without predefined k 
is called Nonparametric Nonnegative Matrix Factor¬ 
ization. 

2.2 Indian Buffet Process 

The Indian Buffet Process (IBP) [10], [11] is defined as 
a prior for the binary matrices with an infinite number 
of columns. The graphical model is shown in Fig. 1(a). 
A stick-breaking construction for IBP [14] is proposed 
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TABLE 1 

Notations in this paper 


Symbol 

meaning in this paper 

M 

the row number of Y 

N 

the column number of Y 

K 

the latent factor number 

Y 

data matrix with size M x N 

ym,n 

the element of Y at m row and n column 

A 

factor matrix with size M x K 

X 

factor matrix with size N x K 

^m,k 

the element of A at m row and k column 

^n,k 

the element of X at n row and k column 


mask matrix for A 


mask matrix for X 

^m,k 

the mask binary variable for element of A at m row 
and k column 

'^n,k 

the mask binary variable for element of X at n row 
and k column 

VTT 

loading matrix for A 

VTT 

loading matrix for X 

m,k 

the loading variable for element of A at m row and 
k column 


the loading variable for element of X at n row and 
k column 


k stick weight of IBP for A 

(2) 

k stick weight of IBP for X 

g 

model parameters for Bivariate beta distribution or 
copula 


as follows: 

k 

Uj ~ Beta{a, 1), I'j, Zn,k ^Bernoulli{^k) 

i=i 

( 2 ) 

where Zn,k form a mafrix Znxk, Vj is fhe variable 
wifh a befa disfribufion, and a is fhe paramefer of befa 
disfribufion. /ife is fhe sfick weighf of column k. The 
bigger pfc is, the more 'ones' there are in the column 
k of the binary matrix Z^xk- The final number of K 
is determined by the data and the parameter a of IBP. 

An intuitive idea of applying IBP for Nonparamet- 
ric Nonnegative Matrix Factorization is to separately 
set two IBPs as priors of A and X. One problem with 
this idea is that the number of columns Ki and K 2 
generated by the two IBPs may be different, which 
violates the requirement of NMF. 

3 Related Work 

The related work mainly falls into two categories: one 
concerns research on the Nonnegative Matrix Factor¬ 
ization and the other concerns about the Indian Buffet 
Process. These state-of-the-art researches inspire our 
idea of using a dependent Indian Buffet Process for 
nonparametric nonnegative matrix factorization. 

3.1 Nonparametric Nonnegative Matrix Factoriza¬ 
tion 

Types of research on the Nonnegative Matrix Factor¬ 
ization include supervised or semi-supervised exten¬ 
sion [13], [15], the convergence rate [16], Sparse [17], 
Nonparametric [18] and more. 


There are also some researches on Bayesian exten¬ 
sion of NMF which aim to model the NMF using the 
distributions. In one example, LDA [19] and corre¬ 
lated LDA [20] ideas are transferred to the Bayesian 
parametric NMF and correlated NMF [21]. These ex¬ 
tensions are still parametric, however, which means 
the dimension number still need to be predefined. 

The nonparametric extension of NMF mainly relies 
on the machinery of stochastic processes. The Beta 
process is used as the prior of one factor matrix in 
[18]. The Gamma process is used to generate the coef¬ 
ficients for the combination of corresponding elements 
in factor matrices rather than the prior of the factor 
matrices [22]. Both have been successfully applied 
for music analysis. IBP is used as the prior of one 
factor matrix and another factor matrix is drawn from 
Gaussian distribution; an efficient inference method 
(Power-EP) is proposed for this model [23]. These 
processes can be considered as the extension of the 
latent feature factor model [11]. However, there is 
no work to place the priors for two factor matrices 
simultaneously. 

3.2 IBP and dIBP 

The idea of dependent nonparametric processes was 
first proposed by MacEachern [24]. Seven classes of 
dependence are summarized [25]. The basic IBP is 
proposed in [10], [11], and is actually a marginaliza¬ 
tion of the Beta-Bernoulli process. Its widespread pop¬ 
ularity is due to its power to generate a binary matrix 
with infinite columns. The dIBP was first proposed in 
[26] based on Gaussian process, and can be used for 
the nonparametric NMF after being embedded in our 
proposed framework which is discussed in Section 
4. The Kernel Beta process [27] is another dependent 
beta process model that models the dependence be¬ 
tween data points with different covariants, such as 
the time tags of documents, geographic locations of 
people or GDPs of countries. Hierarchial beta process 
is proposed so that the different beta processes share a 
common base discrete measure [28]. The Phylogenetic 
Indian Buffet Process [29] considers the tree structure 
of the data points, which can be seen as a supervised 
IBP. A coupled IBP [30] is proposed for collaborative 
filtering, which links two IBPs through a factor matrix. 
Therefore, this coupling does not guarantee that the 
factor matrices have the same dimension number 
which is important for the NMF. dimension number 
of two infinite factor matrices . However, there is no 
work on using or constructing dIBP for nonparametric 
NMF. 

4 Nonparametric NMF Framework 

AND THREE IMPLEMENTATIONS 

To set a prior for the infinite matrices A and X in 
NMF and make sure the two matrices have the same 
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(c) Implementations by Bivariate-based or Copula-based dependent Indian Buffet Process 


Fig. 1. Graphical Models for (a) the original Indian Buffet Process, (b) GP-based dependent Indian Buffet Process 
and (c) Bivariate beta distribution-based or Copula-based dependent Indian Buffet Process 


number of columns, we apply the dependent Indian 
Buffet process as the prior for two matrices. 

Nonparametric NMF Framework, the data matrix 
is modeled as, 

Y = A*X'^ = * (V(2) 0 ^(2))T (3) 

where and are two binary matrices, 
and are loading matrices, and © denotes the 
Hadamard product. A dIBP is used as the prior for 
Z^^'^ and Z^'^K The likelihood of nonnegative matrix 
factorization is, 

Om.fc = ~ gamma{l,Ti) 

^n,k = ~ gamma{l, tz) (4) 

ym,n ; Xn,k ^Exp{ym,n] E O'm^k ' “b 

k 

where Exp{) denotes the exponential distribution, the 
selection of the gamma and exponential distributions 


for V is used to guarantee the nonnegativity of A 
and X, and each element of Y satisfies an exponential 
distribution with support [0, -l-oo). The special param¬ 
eters for the distributions are to retain the desired 
expectations of these distributions. For example, the 
expectation of distribution ym,n is J2k ^m,k ■ Xn,k + £• 
The matrices and are two loading matrices. 
Since IBP can only generate two binary matrices ZA) 
and ZA)^ we need VA and VA to approximate the 
data Y. Finally, the {VA q zA^) and (VA^ © zA'>'j 
can be seen as the A and X in Eq. (3). The number 
of columns is not predefined but is learned from the 
data. 

The implementation of nonnegative matrix factor¬ 
ization is only conducted to select a prior for ZA') and 
ZA, In the following subsections, we will introduce 
three implementations of this framework through 
three dIBPs. 
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4.1 Implementation by GP-based dIBP 

The first dIBP is proposed based on Gaussian Process 
(GP) [26]. In this GP-based dIBP model, each stick 
weight Hk is used to generated a different number 
of columns of binary matrices. For the NMF, we can 
use this model as the prior for the matrices A and X. 
The graphical model is shown in Fig. 1(b), and the 
generative process is as follows, 

k 

Vj ~ Beta{a, 1), (5) 

i=i 

where the pfc are IBP sticks as in Eq. (2). This set of 
sticks is shared by two binary matrices through 

5fc~GP(0,Efc) 

= cr^ exp(- J ^ ) 

r(i) = r(2) = 

^nl,k = Hhi^lk < 

^Z^,k = Hhll, < 

where F{-) is the normal cumulative distribution 
function, i5{ } is the indicator function, and is the 
kernel function for the GP. The model details can 
be found in [26]. Since there are only two binary 
matrices, the GP is degenerated to a two-dimensional 
Gaussian distribution, and is a 2 x 2 matrix. 

4.2 Implementation by Bivariate Beta 
Distribution-based dIBP 

There are different ways to link two IBPs. The GP- 
based IBP uses the same set of sticks for different 
binary matrices. Here, we propose another method 
that links the initials of two IBPs, ly. In the original 
IBP, 1 / satisfies a beta distribution with parameter 
{a, 1). Therefore, we use a joint distribution 
with beta distributions Beta{ai, 1) and Beta{a 2 , 1) as 
marginal distributions. Following this idea, the intu¬ 
itive candidate would be Dirichlet distribution. How¬ 
ever, there is a strictly negative relation, = 1, 

between the samples of the Dirichlet distribution, but 
we hope to preserve the freedom of two {iy^^\ iy^‘^'>). 

Instead of the Dirichlet distribution, a bivariate beta 
distribution [31] is adopted. The probability density 
function is defined as, 

a,a-ly&-l(i _ a;)b+c-l(l _ y^a+c-l 

B{a,h,c){l - xyY+^+<= (6) 

s.t., 0<x,y<\, a,b,c>0 

where a, b, c are three parameters of this distribution. 
One merit of this bivariate beta distribution is that 
two marginal distributions are, 

x^Beta{a,c), y ^ Beta{b,c) (7) 


Another metric is that this distribution models pos¬ 
itive correlation between with range [0,1] 

adjusted by the three parameters (a, b, c) compared 
to the Dirichlet distribution. Here, we set c of the 
bivariate beta distribution to 1. The reason is because 
we must ensure that the marginal distribution of each 
^ is a beta distribution with parameter (a, 1). This 
condition is for the distribution of binary matrices 
generated by the model that satisfies the IBPs. Con¬ 
sidering Eq. (7), we give c a fixed value. Even with a 
fixed value for c, the bivariate beta distribution in Eq. 
(6) is still able to model different correlations of two 
variables. Eor example, when a = 2.5 and 6 = 4, the 
correlation between x and y is 0.978; when a = 0.05 
and 6 = 0.1, the correlation between x and y is 0.080. 

With the desired bivariate distribution in hand, we 
build the model as follows, 

^ biBeta{0), 0 : {a,b,c = 1} > 0 

zZk BernoulliiZ^'>), ^ J] 

j=i ( 8 ) 

Bernoulli{y,^i^^), 

i=i 

where biBeta{-) denotes the bivariate beta distribution 
in Eq. (6) and 6 denotes the parameters of the distri¬ 
bution. The graphical model is shown in Eig. 1(c). 

The likelihood function is the same as Eq. (4). 
Although the bivariate beta distribution-based dIBP 
has extended the freedom of the relation between lyG) 
and ly^^'), the relation is restricted to positive relations 
in bivariate beta distribution. We use the copula to 
capture more freedom of relations. 

4.3 Implementation by Copula-based dIBP 

Copula [32] is another alternative for linking two 
variables with given marginal distributions, and is 
used to define a joint distribution for variables with 
known marginal distributions. Here, we select the 
Parlie-Gumbel-Morgenstem (PGM) Copula [33] as an 
example. 

The definition of the PGM Copula is, 

Cp{u,v) =uv +puv{l-u){l-v), pG[-l,l] 

Cp(u, f) = 1 -I- p{2u — l)(2w — 1) 

where 

u = ~ (j/(i))“h fiiyG)) ^ ai ■ 

V = . (j,(2))a2-l 

( 10 ) 

and Cp{u,v) is the cumulative distribution function, 
Cp{u,v) is the probability density function, and p is 
the parameter of the PGM copula, u and v are two 
marginal distributions that are known in advance. Eor 
our dIBP model, these marginal distributions are beta 
distribution with parameters (ui, 1) and ( 02 , !)• 
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Fig. 2. FGM copula density surface with different values of p = {-1,0,1} and the marginal distributions are both 
beta distributions; beta{l, 1) (means that ai = a 2 = 1 in Eq. (9)). 


The correlation or dependence of and is 
modeled or reflected by the value of p. As shown in 
Fig. 2, the different correlation (i.e., positive or nega¬ 
tive) can be captured by the value of p. In particular, 
if p = 0, there is no correlation between two marginal 
distributions. 

FGM-based dIBP is defined by replacing the bivari¬ 
ate beta distribution in Eq. (8) with joint distribution 
defined by the FGM copula, 

= Cp(6'), 

0 : {p S [~lj 1]; Oi > 0, 02 > 0} 

The graphical model is the same as the bivariate beta 
distribution in Fig. 1(c). The Spearman correlation can 
be evaluated by coco = Since the support of p is 
[—1,1]/ the correlation range that can be modelled by 
the FGM Copula is [— 

Comparing the bivariate beta distribution, the ad¬ 
vantage of copulas is that we can easily obtain the 
cumulative distribution function and probability den¬ 
sity function of because we only know 

the exact form of probability density function of the 
bivariate beta distribution but not the exact form of 
the cumulative distribution function. This will impact 
on the model inference, which will be introduced later. 

One advantage of bivariate beta distribution-based 
and copula-based dIBP compared with GP-based 
dIBP that less latent variables are involved. This is 
easily observed through the graphical models in Fig. 
1. More latent variables will slow the convergence of 
the Gibbs sampler. 


method, which uses a relatively big K as the (poten¬ 
tial) maximum number of factors, is widely accepted. 

Note that GP-based dIBP is not our contribution. 
Our contribution is to link the GP-based dIBP with 
the nonnegative matrix factorization likelihood. The 
inference of this model, which is not our contribu¬ 
tion, is given in the Appendix for the self-contained 
purpose. The details can be found in [26]. 

5.1 Update /r 

When updating we need to find the conditional 
distribution of given pfe_i and pk+i, because the 
order of p must be maintained to make the marginal 
distributions of Z satisfy two IBPs. The acceptance 
ratio of the M-H sampler for p(fc) = is, 

A P(^lh(fc)) •P(h(fc)lh(fc+i),h(fc-i)) ^ g(p(fc)) \ 

\ ’p(^lh(fc)) •P(h(fe)lh(fe+i):h(fe-i)) 9(h(fc))/ 

( 12 ) 

where p(Z|p(fc)) is the likelihood of p(/j) to generate 
/cth column of binary matrix Z. 

The p(P(j.)|p(/c+i):h(fe-i)) in Eq. (12) is the condi¬ 
tional probability density function of p^^,^ within the 
range of [pi^k+i), This will be different when 

different strategies are used to link p|^,j and Next, 
we derive the conditional probability density function 
of p(fe) as follows: 

Eor the first column. 


(13) 


(14) 


bution, p{fi,Z,Y,6\Y). It is difficult to perform pos¬ 
terior inference under infinite mixtures, thus a com- Eor the second column, 
mon work-around solution in nonparametric Bayesian 

learning is to use a truncation method. The truncation p^^^ = p!'^'’ = 


5 Model Inference 

With data Y in hand, the objective of this section 
is to estimate the hidden variables by a properly 
designed MCMC sampler for their posterior distri- 




P2'’ = 1^2'’ 


and 




( 15 ) 
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and 

|JW| 

where is the Jacobian matrix, 

(1 


J« = 


hi 


0 


0 /x«. 


(16) 


(17) 


For the nth column. 


(n) (n) (n-1) 

hi = hi 


(n) (n) (n-1) 

P2 = ^2 h2 


(18) 


and 


|j(n-l)| 


„(") „(") 
n(-P^ _ dh_i 

(n-1) (n-1) 

hi • h2 


where is the Jacobian matrix. 


j(n-l) ^ 


(n-1) 

hi 


(n-1) 

h2 


(19) 


( 20 ) 


To summarize, the conditional density of is 


p{^ 


<") 


_ ^2 ) Ml ^2 ) 

(n-1) ) (n-1) ) („) , j 

Ml 1*2 Ml M2 


( 21 ) 


(n—1) (n —1) 

hi • h2 


(n) (n) 

hi • h2 


The different methods of linking ni wifh ^2 will 
lead fo a differenf join! probabilify densify funcfion 
for fhem. As infroduced in Section 4, we have given 
two specific forms of fheir joinf probabilify in Eq. (6) 
and Eq. (9). 

The proposal disfribufion, q{-), is selecfed as fhe 
producf of fwo independenf fruncafed befa disfribu- 
fions, 

g(pfc) = 1) • betaifi^^^] ^,1) 

~ beta{^, 1), e [Pkli,Pkli\ (22) 

hfc ~ beta{ — , 1), G K+i,Pfc_i] 


5.2 Update Z 

Two binary mafrices, Z : {Z^^\ can be updafed 

separafely, since each elemenf in fwo mafrices satisfies 


a Bernoulli disfribufion wifh fhe following conditional 
probabilities. 


= 1) 

E ^m,l ’ ^n,l “1“ c) 

n I 




( 1 ) 




,.( 1 ) 


+ e) 


= 0) 

a (1 - p^P)Y\_Exp{ym,n; E (^m,l ‘ d^n,l H“ c) 

n I 


OC (1 - P^^^)Y[Exp{ym,n] 


E 

I 


,(l)^(l)^(2) (2) X 


and 


(23) 


Pi^nl = 1) 


O^hfc 


( 2 ) 


Expijjra^n-t E ^m,l ' E c) 


O^hfc 


‘'^'>Y\Exv(v 


+ e) 


= 0) 

,( 2 ) 


0^ (1 “ Pk)W^^piy^,E, E 


Ot (1 - hfc ) n ^Mym,n] 

m I 

(24) 

where e is small positive number. 


5.3 Update V 

Since the prior for V is Gamma distribution and the 
likelihood is exponential distribution, the conditional 
distribution for V is, 

Om,fc) 

(X gam{vl^'*y, Oi) Exp{ym,n\ am,k ■ Xn,k + e) 
n k 

(X gam{v^^\-, Oi) Exp{ym,n] + e) 

n k 

(25) 

and 

P(^'SkS>a:„,fc) 

(X gam{vl^j^; 02) Exp{ym,n] E ‘ d^n,l E c) 

m I 

cx gamiv^^l; 02) Exp{y^y, Y ^n) + 

m I 

(26) 


5.4 Update 9 

In Eig. 1(c), the graphical model has a parameter 
9. Eor the different strategies to link the 

parameters must be different. Therefore, we design 
corresponding update methods for the proposed two 
strategies: bivariate beta distribution and copula. 



Algorithm 1 : Gibbs Sampler for GP-dIBP-NMF 
Input: Y 
Output: A, X 

initialization; 
while i < maxiter do 

/ / latent variables of dIBP 
Update p by Eq. (32); 

Update Z by Eq. (39) and (40); 

Update g by Eq. (34); 

Update h by Eq. (35); 

Update s by Eq. (38); 

/ / latent variables of data of NMF 
Update V by Eq. (25) and (26); 

_ i H—h; 

Select the sample with largest likelihood; 
return A and X; 


5.4.1 Bivariate Beta Distribution 

The parameters of bivariafe befa disfribufion, 9 : 
{a, b}, are given fwo gamma priors. The condifional 
disfribufions are, 

K 

p([a 6]| • ••) oc gam{[a b]-,hp) V) (27) 

k=l 

where hp is fhe h 5 q)er-paramefer of fhe prior for a and 
b. 

5.4.2 Copula 

There are fhree paramefers for each copula, 9 : 
{p,ax,a 2 }. Their condifional disfribufions are, 

K 

pifai 02 ]! • • •) oc gam{\ai 02 ]; hp) n'(h'>.hh (28) 

k^l 

We give the p of the FGM copula a uniform distri¬ 
bution on [—1,1], 

K 

p{p\---)cxWc{pp,Pk^\p) (29) 

fc=i 

Affer infroducing fhe updafe mefhods, we summa¬ 
rize fhe inference for fhe fhree models in Algorifhm 1 
for fhe GP-based dIBP NMP (GP-dIBP-NMP) model, 
Algorifhm 2 for fhe Bivariafe befa disfribufion-based 
dIBP NMP (BB-dIBP-NMP) model, and Algorifhm 
3 for fhe Gopula-based dIBP NMP (G-dIBP-NMP) 
model. 

6 Experiments 

In fhis secfion, we firsf express fhe abilify fo con- 
ducf mafrix facforizafion and hidden factor number 
learning of fhe proposed algorifhms fhrough S 5 mfhefic 
examples, followed by fhe performance of fhe algo¬ 
rifhms on real-world fasks. 


Algorithm 2: Gibbs Sampler for BB-dIBP-NMP 
Input: Y 
Output: A, X 

initialization; 
while i < maxiter do 

/ / latent variables of dIBP 
Update p by Eq. (12); 

Update Z by Eq. (23) and (24); 

Update a and b by Eq. (27); 

/ / latent variables of data of NMF 
Update V by Eq. (25) and (26); 

_ i ~\ —h; 

Select the sample with largest likelihood; 
return A and X; 


Algorithm 3: Gibbs Sampler for G-dIBP-NMP 
Input: Y 
Output: A, X 

initialization; 
while i < maxiter do 

/ / latent variables of dIBP 
Update p by Eq. (12); 

Update Z by Eq. (23) and (24); 

Update CKi and 02 by Eq. (28); 

Update p by Eq. (29); 

/ / latent variables of data of NMF 
Update V by Eq. (25) and (26); 

_ z H—h; 

Select the sample with largest likelihood; 
return A and X; 


6.1 Synthetic dataset 

We randomly generate a matrix I 20 X 30 with elements 
in {0,1}. The proposed algorithms: GP-dIBP-NMP, 
BB-dIBP-NMP and G-dIBP-NMP are used for sparse 
nonnegafive mafrix facforizafion on fhis mafrix. The 
resulfs are shown in Pig. 4, Pig. 5 and Pig. 6, re- 
specfively. In each figure, we have visualized fhe 
generated mafrix and ifs consfrucfed version fhrough 
fhe learned mafrices A and X which are bofh visual¬ 
ized in fhe figure. If can be observed fhaf fhe (dark) 
positions wifh zero values are reconsfrucfed well by 
all fhree algorifhms comparing when fhe original 
Y and fhe reconsfrucfed Y are compared. Anofher 
observafion is fhe sparsify of fhe factor mafrices A 
and X whose number of columns will change wifh fhe 
iferafion. The lasf buf nof leasf observafion is fhe value 
log-likelihoods from fhe fhree algorifhms. A bigger 
log-likelihood value means fhaf fhe model fifs fhe 
dafa beffer. The comparison shows fhaf fhe BB-dIBP- 
NMP has similar performance fo G-dIBP-NMP buf is 
beffer fhan GP-dIBP-NMP. Excepf for log-likelihood, 
fhe convergence rate of GP-dIBP-NMP is fhe worsf 
of fhe fhree models, as can be observed from fhe 
convergence curve in L list in Pig. 4, Pig. 5 and Pig. 
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Fig. 3. Reconstruction error comparisons on syn¬ 
thetic dataset between GP-dIBP-NMF, BB-dIBP-NMF, 
C-dIBP-NMF, and Sparse NMF 



6 . It takes about 200 for BB-dIBP-NMF to converge, 
and 500 for C-dIBP-NMF fo converge, buf GP-dlBP- 
NMF needs more fhan 1500 iferafions fo reach con¬ 
vergence. This is due fo fhe complex model sfrucfure 
of fhe GP-based dIBP. As well as fhe visualizafion, the 
reconstruction error is also quantitatively compared. 
The reconstruction error is defined as 

error = \\Y — AX^Wi (30) 

The comparison is shown in Fig. 3. The solid (lighf 
blue) line in fhe figure shows fhe sparse NMF wifh 
differenf numbers of factors as fhe inpuf. As shown 
in fhe figure, fhe worsf error value is around 25. Since 
fhe proposed algorifhms do nof need to predefine fhe 
factor number, fhere are fhree lines in Fig. 3 fo indicafe 
fhe errors of fhe fhree algorifhms. 

In order fo show fhe flexibilify of our proposed 
dIBP comparing wifh GP-based dIBP, we firsfly have 
designed a mefric fo measure fhis flexibilify by com¬ 
paring the number of non-zero enfries of fwo loading 
mafrices {A and X) from fhe models. As we claimed, 
our proposed model is with more flexibility to allow 
the numbers of non-zero enfries of loading matrices 
more different from each ofher comparing GP-based 
dIBP. Therefore, we have evaluafed fhe mean of fhe 
differences between fhe corresponding columns of A 
and X as ~ I where K is fhe number 

of columns of bofh matrices, is the number of 
non-zero enfries of fc-fh column of A, and n\ ^ is fhe 
number of non-zero entries of fc-fh column of X. We 
have randomly generafed fen mafrices wifh same size: 
20 X 30, and run fhree models on fen mafrices. The 
designed mefric has been evaluafed by on fhe oufpufs 


Fig. 4. Results on synthetic dataset from GP-dlBP- 
NMF 

of models. As shown in Fig. 7, we can see fhaf fhe 
BB-dIBP-NMF and C-dIBP-NMF are with more fluc¬ 
tuations and larger non-zero entry number differences 
comparing wifh GP-dIBP-NMF. If demonsfrafes fhaf 
fhe proposed dIBPs are more flexible fhan GP-based 
dIBP. 

6.2 Real-world Dataset 

In this subsection, we apply the proposed algorithms 
on two real-world tasks: document clustering and rec- 
ommender system. For each task, we will compare the 
proposed algorithms with sparse NMF with different 
numbers of factors. 

6.2.1 Document Clustering 

The real-world datasets used for the document clus¬ 
tering task are: 

• Cora Dataset^ The Cora dataset consists of 2708 
scientific publications classified into one of seven 
classes. Each publication in the dataset is de¬ 
scribed by a 0/1-valued word vector indicating 
the absence/presence of the corresponding word 
from the dictionary. The dictionary consists of 
1433 unique words. 

• Citeseer Dataset The CiteSeer dataset consists 
of 3312 scientific publications. Each publication 

1. http://linqs.cs.umd.edu/projects/projects/lbc/ 
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Fig. 5. Results on synthetic dataset from BB-dlBP- 
NMF. 

in the dataset is also described by a 0/1-valuec 
word vector indicating the absence/presence o 
the corresponding word from the dictionary. Th( 
dictionary consists of 3703 unique words. Th( 
labels of fhese papers are sef as fheir researcl 
areas, such as AI (Arfificial Infelligence), ML (Ma 
chine Learning), Agenfs, DB (Dafabase), IR (In 
formafion Refrieval) and HCI (Human-Compufe: 
Inferacfion). 

The evaluafion mefrics of documenf clusfering arc 
Jaccard Coefficienf (JC), Folkes&Mallows (FM) and F; 
measure (FI). Given a clusfering resulf, 

• a is fhe number of two poinfs fhaf are in fhe same 
clusfer of bofh benchmark resulfs and clusfering 
resulfs; 

• 6 is fhe number of two points that are in the 
same cluster of benchmark results but in differen 
clusters of clustering results; 

• c is the number of two points that are not in the 
same cluster of the two benchmark results but are 
in the same cluster of clustering results. 

and three metrics are computed the equations in Table 
2 (bigger means better). 

When document clustering is selected as the task, 
the output of the algorithms {A is the document-factor 
matrix) is used as the input of the spectral clustering 
algorithm. Since the common clustering algorithm is 
used, the clustering performance of different NMF 
algorithms is only determined by the factors A. The 
clustering results evaluated by the metrics in Table. 


Fig. 6. Results on synthetic dataset from C-dIBP-NMF. 



Trials ID (order-irrelevance) 


Fig. 7. Results on synthetic dataset to show the 
flexibility of the different models. The x-axis denotes 
the trial IDs (order is irrelevant) 


2 on two different datasets are shown in Fig. 8 and 
Fig. 9. In each figure, we run the sparse normegative 
matrix factorization in Eq. 1 with different factor 
numbers (the x-axis). All the algorithms have 1000 it¬ 
erations. The results of the proposed three algorithms 
are also shown in the figure through three horizontal 
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TABLE 2 

Evaluation metrics of document clustering 


Jaccard Coefficient 

a+o+c 

Folkes & Mallows 

/ \ 1/2 
FM= (• -^) 

y a+o a+c j 

FI measure 

_ 2a^ 

2a^ +ac+ab 


lines. The learned factor numbers are 10 (from BB- 
dlBP-NMF), 12 (from C-dIBP-NMF) and 12 (from GP- 
dlBP-NMF) for the Citeseer dataset; 17 (from BB-dlBP- 
NMF), 15 (from C-dIBP-NMF) and 22 (from GP-dlBP- 
NMF) for the Cora dataset. The reason why the hori¬ 
zontal lines are used is that the proposed algorithms 
do not need the factor number as an input, so the clus¬ 
tering results are not dependent of the x-axis (factor 
number). To summarize, without the prior knowledge 
of the number of factors, the proposed algorithms 
achieve relatively good performance on the document 
clustering task. Of the three algorithms, BB-dlBP- 
NMF achieves the best performance. C-dIBP-NMF 
and GP-dIBP-NMF achieve similar performances on 
two datasets only except the FI on the Cora dataset. 

6.2.2 Recommender System 

The dataset for this task is MovieLens lOOK^. This 
data set consists of 100,000 ratings (1-5) from 943 users 
on 1682 movies, and each user has rated at least 20 
movies. In the following experiment, we use a 5-fold 
method for cross validation (20% ratings are kept as 
the test data and the remaining 80% ratings are used 
as the training data). 

When used for the recommender system, the out¬ 
puts of the algorithms A (user-factor matrix) and X 
(movie-factor matrix) reconstruct the rating matrix 
Yr = A* X'^ on the retained test ratings. The quanti¬ 
tative evaluation is 

MAE=\\Yr-Ytest\\l (31) 

Here, we do not normalize MAE since it does not 
impact on the comparison. Note that the sparse NMF 
in Eq. 1 needs to be revised to ignore the test ratings. 
The weighted NMF [34], [35] can be adopted and a 
mask matrix with ones on the training ratings and 
zeros on the test ratings is constructed. Ignoring the 
test ratings can be easily achieved by only considering 
the training ratings in the updates for Z in Eq. 23 and 
Eq. 24 and updates for V in Eq. 25 and Eq. 26. 

The recommendation results are shown in Eig. 10. 
Considering the 5-fold cross test, the mean of the five 
groups is plotted in the figure with a solid line, and 
the standard deviations are shown in the figure by the 
gray area around the mean plot. Three subfigures in 
Eig. 10 denote the comparison of the sparse NME with 
the proposed three algorithms, for which the results 

2. http://grouplens.org/datasets/movielens/ 


are plotted by dashed lines. Similarly, the standard de¬ 
viations are also given by the area around the dashed 
lines. It can be observed that the SNME has better per¬ 
formance than the proposed algorithms only when the 
factor number is smaller than 20. The proposed algo¬ 
rithms outweigh the SNME (smaller MAE values) for 
the remaining choices for the factor number. We want 
to highlight again that the factor number is unknown 
before running SNME. There is a wide possible range 
for this number. Within this wide range, there are only 
approximately 20 choices that give better performance 
than the proposed algorithms. Three algorithms have 
similar performance to one another. The BB-dlBP- 
NME algorithm is the best, with relatively smaller 
MAE, and C-dIBP-NME is the worst. The C-dlBP- 
NME, however, has the smallest variance. The small 
variance means that this algorithm achieves similar 
performance on different test data. In other words, 
an algorithm with small variance will be stable on 
different test data. We can see that SNME is not stable 
compared with the proposed algorithms. 

7 Conclusion and Further Study 

The renowned nonnegative matrix factorization is 
advantageous for many machine learning tasks, but 
the assumption that the dimension of the factors is 
known in advance makes NME impractical for many 
applications. To resolve this issue, we have proposed a 
nonparametric NME framework based on dIBP to re¬ 
move the assumption. Eirst, a model is built by imple¬ 
menting this framework using GP-based dIBP, which 
successfully removes the assumption but suffers from 
larger model complexity and less flexibility. Then, we 
have proposed two new dIBPs through a bi-variate 
beta distribution and a copula. One advantage of the 
models based on new dIBPs is that they have simpler 
model structures than models with GP-based dIBP. 
Lastly, three inference algorithms have been designed 
for the proposed three models, respectively. The ex¬ 
periments on the S 5 mthetic and real-world datasets 
demonstrates the capability of the proposed models 
to perform NMF without predefining the dimension 
number, and the models based on the new dIBPs have 
better convergence rates and more flexibility than the 
model based on GP-based dIBP. 

The bivariate beta distribution-based model and 
copula-based model have achieved comparative per¬ 
formances both on document clustering and recom¬ 
mender system. We here give hints for their usage: 1) 
if the data correlation is within the range [—0.3,0.3], 
the EGM copula-based model will have better flexibil¬ 
ity than the bivariate beta distribution-based model; 
2 ) if the data correlation is outside the range of 
[—0.3,0.3], the bivariate beta distribution-based model 
is more reasonable than the EGM copula-based model. 

One possible future study for this work is the aspect 
of efficiency. Current Gibbs sampling inference is not 
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JC FM FI 





Fig. 8. Document clustering comparisons on Citeseer dataset between GP-dIBP-NMF (gIBP-NMF), BB-dlBP- 
NMF (bIBP-NMF), C-dIBP-NMF (cIBP-NMF), and Sparse NMF (SNMF). 

JC FM FI 





Fig. 9. Document clustering comparisons on Cora dataset between GP-dIBP-NMF (gIBP-NMF), BB-dIBP-NMF 
(bIBP-NMF), C-dIBP-NMF (cIBP-NMF), and Sparse NMF (SNMF). 




Fig. 10. Recommendation comparisons on MovieLens-100K dataset between GP-dIBP-NMF (gIBP-NMF), BB- 
dlBP-NMF (bIBP-NMF), C-dIBP-NMF (cIBP-NMF), and Sparse NMF (SNMF). 
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efficient enough for big data. Our further study will 
focus on the efficiency of the inference of the proposed 
models using the variational inference strategy. 
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Appendix A 

The conditional distributions for the 
GP-based dIBP NMF 


The conditional distributions are: 

Sampling ji 


• • •) cx ^ n (32) 


Nt 


Pk 


t=l 


where 


7 ^ = ( 33 ) 

where T'O is normal cumulative distribution function. 

Sampling g 


Nt 

p(5fc|---) ocA/'(5fc|0,Efc) (34) 

t n 

where N() denotes normal distribution. 
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Sampling h 



(35) 

with support 


f K,k e (- 00 , Pfc] if 4,fe = 1 

l 4 ,fce[hL+oo) if4./c = o 

(36) 

where 

(37) 

Sampling s 


K 

P(s| • • •) oc gamma{s; hs, 1) A/'(gfe|0, E^) 

(38) 
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Sampling Z 


Pi^t]k = 1| • • •) cx 7fc 
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n I 
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n I 


(39) 


and 


P{' 


.(2) 


"n.k 


ocTfeH Exp{y 

m,7 




( 1 ) ( 1 ) ( 2 ) ( 2 ) 


+ e) 


Pi^nl = 0 | • • ■ ) 


- lk)Y[Exp{y„,^n;Yl 


^(1)^(1) .^(2) (2) 


where 7 ^. is same as in Eq. (33). 


e) 

(40) 




Guangquan Zhang is an associate profes¬ 
sor in Faculty of Engineering and Information 
Technology at the University of Technology 
Sydney (UTS), Australia. He has a PhD in 
Applied Mathematics from Curtin University 
of Technology, Australia. He was with the De¬ 
partment of Mathematics, Hebei University, 
China, from 1979 to 1997, as a Lecturer, 
Associate Professor and Professor. His main 
research interests lie in the area of multi¬ 
objective, bilevel and group decision making, 
decision support system tools, fuzzy measure, fuzzy optimization 
and uncertain information processing. He has published four mono¬ 
graphs, four reference books and over 350 papers in refereed jour¬ 
nals and conference proceedings. He has won 6 Australian Research 
Council (ARC) discovery grants and many other research grants. 



Junyu Xuan received the bachelor’s de¬ 
gree in 2008 from China University of Geo¬ 
sciences, Beijing. Currently he is working to¬ 
ward the dual-doctoral degree in both Shang¬ 
hai University and the University of Tech¬ 
nology, Sydney. His main research interests 
include Machine Learning, Complex Network 
and Web Mining. 



Richard Yi Da Xu received the B.Eng. de¬ 
gree in computer engineering from the Uni¬ 
versity of New South Wales, Sydney, NSW, 
Australia, in 2001, and the Ph.D. degree in 
computer sciences from the University of 
Technology at Sydney (UTS), Sydney, NSW, 
Australia, in 2006. He is currently a Senior 
Lecturer with the School of Computing and 
Communications, UTS. His current research 
interests include machine learning, computer 
vision, and statistical data mining. 











15 


Xiangfeng Luo is a professor in the Schooi 
of Computers, Shanghai University, China. 
Currentiy, he is a visiting professor in Purdue 
University. He received the master’s and PhD 
degrees from the Hefei University of Tech- 
noiogy in 2000 and 2003, respectiveiy. He 
was a postdoctorai researcher with the China 
Knowiedge Grid Research Group, institute 

of Computing Technoiogy (iCT), Chinese 

Academy of Sciences (CAS), from 2003 to 
2005. His main research interests inciude 
Web Wisdom, Cognitive informatics, and Text Understanding. He 
has authored or co-authored more than 50 pubiications and his 
pubiications have appeared in IEEE Trans, on Automation Scienoe 
and Engineering, IEEE Trans, on Systems, Man, and Cybernetics- 
Part C, IEEE Trans, on Learning Technology, Concurrency and Com¬ 
putation: Practioe and Experienoe, and New Generation Computing, 
etc. He has served as the Guest Editor of ACM Transactions on 

Intelligent Systems and Technology. Dr. Luo has also served on 

the committees of a number of conferences/workshops, including 
Program Co-chair of ICWL 2010 (Shanghai), WISM 2012 (Chengdu), 
CTUW2011 (Sydney) and more than 40 PC members of conferences 
and workshops. 



